Association between biochemical and hematologic factors with COVID-19 using data mining methods

Background and aim Coronavirus disease (COVID-19) is an infectious disease that can spread very rapidly with important public health impacts. The prediction of the important factors related to the patient's infectious diseases is helpful to health care workers. The aim of this research was to select the critical feature of the relationship between demographic, biochemical, and hematological characteristics, in patients with and without COVID-19 infection. Method A total of 13,170 participants in the age range of 35–65 years were recruited. Decision Tree (DT), Logistic Regression (LR), and Bootstrap Forest (BF) techniques were fitted into data. Three models were considered in this study, in model I, the biochemical features, in model II, the hematological features, and in model II, both biochemical and homological features were studied. Results In Model I, the BF, DT, and LR algorithms identified creatine phosphokinase (CPK), blood urea nitrogen (BUN), fasting blood glucose (FBG), total bilirubin, body mass index (BMI), sex, and age, as important predictors for COVID-19. In Model II, our BF, DT, and LR algorithms identified BMI, sex, mean platelet volume (MPV), and age as important predictors. In Model III, our BF, DT, and LR algorithms identified CPK, BMI, MPV, BUN, FBG, sex, creatinine (Cr), age, and total bilirubin as important predictors. Conclusion The proposed BF, DT, and LR models appear to be able to predict and classify infected and non-infected people based on CPK, BUN, BMI, MPV, FBG, Sex, Cr, and Age which had a high association with COVID-19.


Introduction
The global numbers of new cases from Coronavirus Disease 2019 (COVID-19) continues to rise, the world's agencies, institution and governments are still working towards identifying individuals who are at greatest risk of infectious [1].Identification of these predictive factors will make it possible to optimized allocation the human and technical resources for management [2,3].In addition, such predictors would also allow designing the interventional studies to target patients at risk of worsening and progression to death [4].
Studies have shown that certain demographic factors are related to the severity of COVID-19 [2,5,6].Among these, older age is an important predictor of mortality and male sex is a parameter in the proposed clinical severity risk scores [7].Pre-existing conditions, such as diabetes mellitus, obesity, cardiovascular disease, hypertension (HTN), chronic lung diseases (particularly COPD), chronic kidney disease, immune-suppression and sickle cell disease, predispose patients to an adverse clinical course and elevated risk of intubation and death [8].
Regarding laboratory tests, studies have reported laboratory parameters that may predict COVID-19 prognosis [9].Findings commonly in relation to poor outcomes including increased lactate dehydrogenase (LDH), C-reactive protein (CRP), D-dimer levels and high-sensitivity cardiac troponin I [10].
More knowledge of the specific symptoms and risk determinants of COVID-19 in different clinical settings are needed to properly treat these patients and to avoid disease complications [7,11].Thus, this study was conducted to assess and analyze treatment, laboratory and hospital results and the clinical and hematological features of COVID-19 patients at a Khorasan Razavi Health Center, Iran.The purpose of the current study was therefore to provide an overview of the relationship between COVID-19 and demographic, biochemical, and hematological features, in order to better understand the situation, improve the treatment and management of the disease in the future and present an image of the disease burden in Iran applying machine learning algorithms.
In many areas of medicine, machine learning techniques have been useful for prediction and classification.In machine learning, the two primary task categories are "supervised" and "unsupervised" [12].An algorithm for supervised machine learning is a decision tree (DT) used in medical applications [13][14][15][16].Traditional statistical techniques make it difficult to choose predictors, so we applied data mining techniques like DT to forecast the biochemical and hematologic measurements most closely associated with COVID-19.In the fields of medicine, public health, etc., logistic regression (LR) is applied to calculate the association between one or more independent (predictor) variables and a binary dependent (outcome) variable [17][18][19].
The Bootstrap Forest (BF) platform fits an ensemble model by averaging several DTs, each of which is fit to a bootstrap sample of the training data.Each split in each tree shows a random subset of the predictors.

Study population
This study was conducted on a population of 13,170 in the age range of 35-65 years including 5780 subjects with severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) and 7390 subjects without SARS-COV-2 from the MASHAD cohort study (Phase I) as previously described [20].The Ethics Committee of the Mashhad University of Medical Sciences reviewed and approved the informed consent form, study protocol, and other study related documents.All participants provided informed, written consent.

Blood sampling
According to a standard protocol, all blood samples were collected from an antecubital vein of all participants following 12-14 h of overnight fasting between 8-10 am in a sitting position.The details of laboratory measurements and cut-offs are explained in the baseline report of the MASHAD cohort study, as described previously [20].

Demographic data
Health care professionals and a nurse gathered demographic characteristics (e. g. age, sex, and smoking status from participants by interviewing.

Anthropometric assessments
Anthropometric measurements, including weight, height, body mass index (BMI) and waist circumference, were measured in all subjects of the research according to standardized protocols [20].

Diagnosis of COVID-19
Data on the diagnosis of COVID-19 was obtained from the SINA Healthcare System, which records the electronic health profiles of patients in hospitals and health centers in Mashhad, Iran.Data collection began from the onset of the disease to the end of March 2021.Diagnosis of the disease was confirmed using a lung spiral computerized tomography (CT) scan and/or polymerase chain reaction (PCR) laboratory test.The flow chart of this study is given in Fig. 1.

Statistical analysis and model building
For analyzing the data, SAS JMP Pro version 13 (SAS Institute Inc., Cary, NC) and SPSS version 22 (Armonk, NY: IBM Corp.) were applied.Chi-square and Fisher's exact tests were applied to measure the association between categorical variables.Also, T independent test is for comparing the means not for normality.
In this study there was an unbalanced dataset (Cov + compared to Cov-).Thus, a Synthetic Minority Oversampling Technique (SMOTE) algorithm was used in LR, DT, and BF algorithms to transform the unbalanced data set into a balanced one [21,22].Based on SMOTE algorithm, sampling was done from 10 observations so that 8 or 9 cases of disease and a maximum of 2 cases of non-disease were selected.In each step, the samples were repeated based on the posterior distribution function.These steps were continued until the number of cases of the disease was very close to another category, i.e., non-infection.
LR is a statistical model, which is utilized to model dichotomous targets and deducing the effect of explanatory variables on the dichotomous target variable [23,24].Providing a good direct or inverse association between the inputs or explanatory variables and the target is the main advantage of applying LR algorithm.
In order to evaluate the performance of the LR, DT, and BF algorithms and comparisons, we gave the confusion matrix (Accuracy, Sensitivity, Precision, and Area Under Curve (AUC) of the receiver operating characteristics (ROC) curve) of the algorithms for training data and also for all models.

Main findings
We have attempted to use the LR, DT, and BF models to diagnostic COVID-

Discussion
This cohort and retrospective study which compared 5780 infected participants to COVID-19 and 7390 subjects without COVID-19 from Mashhad, Iran in terms of baseline profiles, clinical features, and outcomes.We investigated the relationship between sex, age, BMI, SBP, DBP, and smoking status as demographical factors, biochemical features including BUN, serum zinc, copper, Cr, triglyceride, cholesterol, FBG, hs-CRP, phosphorus, LDL-C, HDL-C, Gamma-GT, CPK, direct bilirubin, calcium, total bilirubin, AST, ALT, ALP, uric acid, and magnesium, and hematologic features including WBC, RBC, hemoglobin, hematocrit, MCV, MCH, MCHC, RDW, PDW, and MPV with COVID-19 through DT, BF, and LR algorithms, to obtain the related parameters and the best predicting factors.We propose three models, in Model I, the association between COVID-19 and biochemical features, in Model II, the association between COVID-19 and hematologic features, and in Model III, the association between COVID-19 and both biochemical and hematologic features were assessed.In Model I, our BF, DT, and LR algorithms illustrated that CPK, BUN, FBG, BMI, total bilirubin, sex, and age, as important predictors.In Model II, our BF, DT, and LR algorithms illustrated that BMI, sex, MPV, and age as important predictors.Finally, in Model III, our BF, DT, and     decreased, but level of hemoglobin, RBC, GRAN% increase in patient with COVID-19 [26].It suggested that hematological parameters have important role in prognostic implications.SARS-COV-2 has a high transmission potential, especially in the elderly and those with underlying diseases [7].Numerous studies have attempted to show the COVID-19 incidence in people with metabolic disorders, especially diabetics who are prone to COVID-19 due to a compromised immune system [27][28][29].Diabetes is one of the most frequent underlying comorbidities in patients with COVID-19, according to recent reports, and it is related to prevalence and mortality in these patients [30,31].The present study makes several noteworthy contributions to the critical feature of the relationship between demographic, biochemical, and hematological characteristics, in patients with and without COVID-19 infection by data mining approaches.In the same vein, a data mining study by Marhl et al. aimed to deduce the physiological roots of clinical findings relating diabetes to the severity and adverse effect of SARS-COV-2.They also suggested clinical biomarkers that could predict a higher risk, such as HTN, elevated serum alanine aminotransferase, high Interleukin-6, and a low lymphocyte count [32][33][34].
The results of some studies consistently indicated a high incidence of diabetes in SARS-COV-2 patients (24.9%) and statistically significant statistical difference between SARS-COV-2 patients with diabetes and those without diabetes in hospitalized SARS-COV-2 patients [31,35].The most striking result to emerge from the data is that that serum levels of FBG were significantly different between case and control groups.Also, as DT and BF showed, serum levels of FBG were significantly increase the risk of COVID-19.
Furthermore, there was a significant difference in LDL-C levels between the case and control groups.Similarly, Wei et al. found that LDL-C levels in SARS-COV-2 patients were slightly lower than in healthy participants [36].
According to data from China, while men and women have the same prevalence of SARS-COV-2, infected men were more likely to die than women [37,38].Here, all models illustrated that the incidence of COVID-19 was more in men.
There was an association between smoking and COVID-19, which was in country with a recent meta-analysis study [39][40][41].In fact, the obtained results showed that, the incidence of COVID-19 was more in smokers.
In our LR algorithm in Model I, a significant correlation was found in SBP and DBP with COVID-19 which increased the incidence.In accordance with the results from Schiffrin et al. (2020), it is uncertain whether uncontrolled HTN is a risk factor for SARS-COV-2 infection [42] while, Pranata et al. investigated that HTN was a high risk of death, severe COVID-19, acute respiratory distress syndrome (ARDS), intensive care unit (ICU) admission, and disease progression in COVID-19 patients [43].High SBP is a source of end-organ damage and a significant comorbid factor, according to a new report published in 2021 [44].
In this study, we identified an association between SARS-COV-2 and component factors of dyslipidemia such as cholesterol, triglycerides, and HDL-C.In fact, LR algorithm showed that HDL-C decreased the incidence of infection.As stated by Hariyanto et al., dyslipidemia increases the risk of experiencing serious outcomes from SARS-COV-2 infections [45].In 2020, several studies investigated to describe the correlation of lipid profile and COVID-19.Hua et al. found that serum HDL-C concentrations decreased significantly in the early stages of SARS-COV-2 infection [46] and Wei Ye et al. have found a substantial decrease in cholesterol levels in COVID-19 patients' serum [36].This result may be explained by the fact that HDL-C, LDL-C, Triglyceride, and Cholesterol level in the baseline of our study is significant between the studied groups.
Based on the findings from Zhu et al., the positive chest CT scan of COVID-19 patients were correlated with CRP levels which showed that CRP levels rise in the majority of serious and critical cases, and were associated to their prognosis [47].By the way, there was a relationship between hs-CRP levels and SARS-COV-2 in this study.
In accordance with the published results, hospitalized patients with COVID-19 infection had impaired liver function.Their liver inflammatory markers including AST, ALT, ALP, total bilirubin, and Gamma-GT have been elevated [48][49][50].The obtained results of this study in majority cases confirm the previous research.
Electrolyte balance and adequate mineral and vitamin intake are main parameters that impact disease progression.Since they have an effect on the immune system, electrolyte imbalance and lack of trace elements or vitamins raise the risk of serious infection [51].Iron, magnesium, uric acid, calcium, and BUN were investigated in current research, and it was found that they had an association with SARS-COV-2.
A limitation of this study is that the numbers of patients were relatively small.The current research was not specifically designed to evaluate anthropometric parameters and nutritional questionnaires.It is suggested that the association of these factors is investigated in future studies.

Conclusion
This project was undertaken to design and evaluate biochemical and hematological assessment in the MASHAD cohort study and compare these between COVID-19 infected patients and non-infected subjects.Our DT and BF model appears to be able to predict and classify infected and non-infected people based on biochemical and hematologic factors which had an association with SARS-COV-2.

Fig. 1
Fig. 1 Flow chart of this study

Fig. 2 Fig. 3
Fig. 2 Graphical representation of the classification tree introduced for SARS-COV-2 diagnosis for Model I

Fig. 4
Fig. 4 Graphical representation of the classification tree introduced for SARS-COV-2 diagnosis for Model III

Table 1
Summary of the demographic characteristics of this study Alanine aminotransferase, Cr Creatinine, BMI Body mass index, DBP Diastolic blood pressure, SBP Systolic blood pressure, BUN Blood urea nitrogen, FBG Fasting blood glucose, Gamma-GT Gamma glutamyl transferase, CPK Creatine phosphokinase, ALP Alkaline phosphatase, WBC White blood cells, RBC Red blood cells, MCV Mean corpuscular volume, MCH Mean corpuscular hemoglobin, MCHC Mean corpuscular hemoglobin concentration, RDW Red cell distribution width, PDW Platelet distribution width, MPV Mean platelet volume In the training phase of DT, the important variables were selected and the final tree is given after pruning.Models I, II, and III runs with 17, 8, and 18 variables as input, respectively.In Model I, CPK, age, BUN, BMI, ALP, sex, total bilirubin, hs-CRP, FBG, and Gamma-GT, in Model II, age, MPV, sex, BMI, hemoglobin, and MCHC, and in Model III, CPK, Cr, BUN, BMI, FBG, age, MPV, MCHC, sex, and total bilirubin variables remained in models.Based on Table 5, the tree is made based on biochemical, hematologic, and both of the variables (Model I, Model II, and Model III, respectively) that had 73.24%, 70.53%, and 68.80% accuracy on the training data, respectively.The other performance indices were given in Table 5 (b), (e), and (h).The rules from DTs for Model I, II, and III is shown in Table 6.Rule 1 in Model I was illustrated that in a subgroup with CPK > = 114.09& BUN > = 30.00& BMI > = 26.77& Age > = 54.00 & Gamma-GT > = 16.91, the chance or probability of having Cov + was 84.69%.In another subgroup, CPK < 114.09 & CPK < 88.06 & Sex(female) & ALT < 9.00 led to a 6.57% chance of having Cov + .The rules from Model II, were illustrated that there was an 86.46% chance that participants with features such as Age > = 54.00 & BMI > = 26.77& MPV > = 9.60 & Sex(male) & Hemoglobin < 15.8 be infected with COVID-19.Another rule was suggested that the probability of I, the BMI, BUN, age variables have been defined as the most crucial variable with high OR by the LR algorithm.With a unit increase in BMI, the chance of being Cov + was 1.092 times.With a year increase in age, the chance of being Cov + was 1.048 times, and with a unit increase in BUN, the chance of being Cov + was 1.041 (see Table2).In Model II, BMI, age, hemoglobin, hematocrit, sex, MPV, smoking status, and MCHC were significant (P-value < 0.05).The hemoglobin had an OR equal to 4.292, so, the chance of being Cov + was 4.292 times.The MPV had an OR equal to 1.550, so, the chance of being Cov + was 1.550 times.Table3showed the other variables and values of effect.In Model III, CPK, BMI, MPV, FBG, sex, BUN, Cr, iron, magnesium, total bilirubin, hemoglobin, hematocrit, MCHC, smoking status, age, WBC, HDL-C, and ALT were correlated with COVID-19 status (P-value < 0.05).The total bilirubin and MPV had an OR 1.647 and 1.447, so, the chance of being Cov + was 1.647 and 1.447 times, respectively (see Table4).Based on Table5, for LR algorithm the accuracy of three models (Model I, II, and III) were 75.13%, 68.28%, and 69.63%, respectively.The other performance indices were given in Table5(a), (d), and (g).

Table 2
The results of LR algorithms for Model I High density lipoprotein cholesterol, hs-CRP Highsensetive C reactive proptein, AST Aspartate aminotransferase, ALT Alanine aminotransferase, BMI Body mass index, DBP Diastolic blood pressure, SBP Systolic blood pressure, BUN Blood urea nitrogen, FBG Fasting blood glucose, Gamma-GT Gamma glutamyl transferase, CPK Creatine phosphokinase Another rule was suggested that the probability of Cov + in individuals with CPK < 114.09 & Cr < 1.40 & Cr < 1.00 & FBG < 118.34 & Sex(female) was 9.90%.Other rules were stated in Table 6.Hence, the CPK and BUN for Model I, age, BMI, and MPV for Model II, and CPK and BUN for Model III were defined as most crucial variables.The final DT is shown in Figs. 2, 3, and 4. In the final step, for another analysis we applied BF for analyzing the data based on COVID-19.The factors included in the BF algorithm were 17, 8, and 18 variables for Model I, II, and III, respectively.Moreover, we set the following specifications for Model I: Number of Trees in the Forest: 29 for Model I, 13 for Model II, and 53 for Model III, Number of Terms Sampled per Split: 4 for Model I, 2 for Model II, and 4 for Model III, Training Rows: 10,536, Test Rows: 2634, Minimum Splits per Tree: 10, Minimum Size Split: 13 for all three models.Confusion matrix and evaluation indices for comparison of the models I, II, III were stated in Table 5 (c), (f ), and (i).Additionally, the crucial variables related to COVID-19 based on BF algorithm were: CPK, BUN, FBG, BMI, total bilirubin, and age in Model I, BMI, sex, MPV, and age in Model II, and CPK, Cr, FBG, BMI, BUN, total bilirubin, sex, MPV, and age for Model III.As one can check the obtained features from BF algorithm were equal to the obtained factors from LR and DT algorithms.

Table 3
The results of LR algorithms for Model II

Table 4
The results of LR algorithms for Model III Cr Creatinine, BMI Body mass index, BUN Blood urea nitrogen, FBG Fasting blood glucose, CPK Creatine phosphokinase, WBC White blood cells, MCHC Mean corpuscular hemoglobin concentration, MPV Mean platelet volume The DT with 5 layers, identified the various risk factors for SARS-COV-2.Based on our results, in the subgroup with Age > = 54, BMI ≥ 26.7, MPV ≥ 9.6, and hemoglobin < 15.8, eighty-six percent of subjects * Significant at error level 0.05 Abbreviations: ALT Alanine aminotransferase, LR algorithms illustrated that CPK, BMI, MPV, BUN, FBG, sex, Cr, age, and total bilirubin as important predictors.This paper attempts to show that graphical representation of the classification tree for hematologic factors (Model II).

Table 5
Model performance indices of the LR, DT, BF algorithms for Model I, II, and III in training data

Table 6
Extracted rules the DT algorithms for Model I, II, and III Abbreviations: hs-CRP high-sensetive C reactive proptein, ALT Alanine aminotransferase, Cr Creatinine, BMI body mass index, BUN Blood urea nitrogen, FBG Fasting blood glucose, Gamma-GT Gamma glutamyl transferase, CPK Creatine phosphokinase, MCV Mean corpuscular volume, MCHC Mean corpuscular hemoglobin concentration, MPV Mean platelet volume, Num Number of rules